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Abstract 

We show that Skilling's method of induction leads to a unique general 
theory of inductive inference, the method of Maximum relative Entropy 
(ME). The main tool for updating probabilities is the logarithmic relative 
entropy; other entropies such as those of Renyi or Tsallis are ruled out. 
We also show that Bayes updating is a special case of ME updating and 
thus, that the two are completely compatible. 



1 Introduction 

The method of Maximum (relative) Entropy (ME) ^ [21 EU is designed for up- 
dating probabilities when new information is given in the form of a constraint 
on the family of allowed posteriors. This is in contrast with the older MaxEnt 
method 4 which was designed to assign rather than update probabilities. The 
objective of this paper is to strengthen the ME method in two ways. 

In |3] the axioms that define the ME method have been distilled down to 
three. In this work the justification of the method is improved by considerably 
weakening the axiom that deals with independent subsystems. We adopt a 
consistency axiom similar to that proposed by Shore and Johnson 1 : When two 
systems are independent it should not matter whether the inference procedure 
treats them separately or jointly. The merit of such a consistency axiom is that 
it is very compelling. Nevertheless, the mathematical implementation of the 
axiom has been criticized by Karbelkar [5] and by Uffink [SJ. In their view it fails 
to single out the usual logarithmic entropy as the unique tool for updating. It 
merely restricts the form of the entropy to a one-dimensional continuum labeled 
by a parameter r\. The resulting ?y-entropies are equivalent to those proposed 
by Renyi [7] and by Tsallis [5] in the sense that they update probabilities in the 
same way. 

The main result of this paper is to go beyond the insights of Karlbelkar and 
Uffink, and show that the consistency axiom selects a unique, universal value 
for the parameter 77 and this value (77 = 0) corresponds to the usual logarithmic 

'Presented at MaxEnt 2006, the 26th International Workshop on Bayesian Inference and 
Maximum Entropy Methods (July 8-13, 2006, Paris, France). 
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entropy. The advantage of our approach is that it shows precisely how it is that 
77-entropies with 77 7^ are ruled out as tools for updating. 

Our second objective is mostly pedagogical. The preeminent updating method 
is based on Bayes' rule and we want to discuss its relation with the ME method. 
We start by drawing a distinction between Bayes' theorem, which is a straight- 
forward consequence of the product rule for probabilities, and Bayes' rule, which 
is the actual updating rule. We show that Bayes' rule can be derived as a special 
case of the ME method, a result that was first obtained by Williams [^2^1 l° n g 
before the logical status of the ME method had been sufficiently clarified. The 
virtue of our derivation, which hinges on translating information in the form of 
data into constraints that can be processed using ME, is that it is particularly 
clear. It throws light on Bayes' rule and demonstrates its complete compati- 
bility with ME updating. A slight generalization of the same ideas shows that 
Jeffrey's updating rule is also a special case of the ME method. 

2 Entropy as a tool for updating probabilities 

Our objective is to devise a general method to update from a prior distribution 
q(x) to a posterior distribution p(x) when new information becomes available. 
By information, in its most general form, we mean a set of constraints on the 
family of acceptable posterior distributions. Information is whatever constrains 
our beliefs. 

To carry out the update we proceed by ranking the allowed probability 
distributions according to increasing preference. This immediately raises two 
questions: (a) how is the ranking implemented and (b) what makes one distri- 
bution preferable over another? The answer to (a) is that any useful ranking 
scheme must be transitive (if P\ is better than Pi, and P2 is better than P3, 
then Pi is better than P3), and therefore it can be implemented by assigning a 
real number S[P] to each P in such a way that if Pi is preferred over P2, then 
5 [Pi] > S[P2\. The preferred P is that which maximizes the "entropy" S[P]. 
This explains why entropies are real numbers and why they are meant to be 
maximized. 

Question (b), the criterion for preference, is implicitly answered once the 
functional form of the entropy S[P] that defines the ranking scheme is chosen. 
The basic strategy is inductive. We follow Skilling's method of induction |2J: 
(1) If an entropy S[P] of universal applicability exists, it must apply to special 
examples. (2) If in a certain example the best distribution is known, then 
this knowledge constrains the form of S[P}. Finally, (3) if enough examples 
are known, then S[P] will be completely determined. (Of course, the known 
examples might turn out to be incompatible with each other, in which case 
there is no universal S[P] that accommodates them all.) 

It is perhaps worth emphasizing that in this approach entropy is a tool for 
reasoning which requires no interpretation in terms of heat, multiplicities, dis- 
order, uncertainty, or amount of information. Entropy needs no interpretation. 
We do not need to know what it means, we only need to know how to use it. 
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The known special examples, which are called the "axioms" of ME, reflect 
the conviction that what was learned in the past is important and should not be 
easily ignored. The chosen posterior distribution should coincide with the prior 
as closely as possible and one should only update those aspects of one's beliefs 
for which corrective new evidence has been supplied. The first two axioms are 
listed below. (The motivation and detailed proofs are found in 
Axiom 1: Locality. Local information has local effects. 

When the new information does not refer to a domain D of the variable x the 
conditional probabilities p{x\D) need not be revised. The consequence of the 
axiom is that non-overlapping domains of x contribute additively to the entropy: 
S[P] = J dx F(P(x),x) where F is some unknown function. 
Axiom 2: Coordinate invariance. The ranking should not depend on the 
system of coordinates. 

The coordinates that label the points x are arbitrary; they carry no information. 
The consequence of this axiom is that S[P] = J dx m(x)&(P(x) /m(x)) involves 
coordinate invariants such as dx m{x) and P(x) /m(x) , where the functions m{x) 
(which is a density) and $ are, at this point, still undetermined. 

Next we make a second use of the locality axiom and allow domain D to ex- 
tend over the whole space. Axiom 1 then asserts that when there is no new infor- 
mation there is no reason to change one's mind. When there are no constraints 
the selected posterior distribution should coincide with the prior distribution. 
This eliminates the arbitrariness in the density m(x): up to normalization m(x) 
is the prior distribution, m{x) oc q(x). 

In the remaining unknown function $ was determined using the following 
axiom: 

Old Axiom 3: Subsystem independence. When a system is composed of 
subsystems that are believed to be independent it should not matter whether the 
inference procedure treats them separately or jointly. 

Let us be very explicit about what this axiom means. Consider a system com- 
posed of two subsystems which our prior evidence has led us to believe are 
independent. This belief is reflected in the prior distribution: if the subsystem 
priors are qi{x\) and (72(^2), then the prior for the whole system is the prod- 
uct q\{x\)q2(x2). Further suppose that new information is acquired such that 
qi{x\) is updated to pi(x\) and that ^2(^2) is updated to ^2(^2)- Nothing in 
this new information requires us to revise our previous assessment of indepen- 
dence, therefore there is no need to change our minds, and the function <!> must 
be such that the prior for the whole system qi(x\)q2(x2) should be updated to 
Pi(.x 1 )p 2 (x 2 ). 

This idea is implemented as follows: First we treat the two subsystems 
separately. Suppose that for subsystem 1 maximizing 



subject to constraints C\ on the marginal distribution P\(x\) = J dx 2 P(xi,x 2 ) 
selects the posterior pi(x\). The constraints C± could, for example, include 




(1) 
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normalization, or they could involve the known expected value of a function 

J dxxfx{xx)Px{xi) = J dxidx 2 fi(x 1 )P(x 1 ,x 2 ) = F\ . (2) 

Similarly, suppose that for subsystem 2 maximizing the corresponding S 2 [P 2 , q 2 ] 
subject to constraints C 2 on P 2 (x 2 ) = J dx\ P(xi, x 2 ) selects the posterior 
P2(x 2 ). 

Next we treat the subsystems jointly and maximize the joint entropy, 

f ( P(xi,x 2 ) \ 

S[P,qiq 2 ] = / dx 1 dx 2 q 1 (xi)q 2 (x 2 )<& — - — ^— — r- , (3) 
J \qi{xi)q 2 {x 2 ) J 

subject to the precisely the same constraints on the joint distribution P. The 
function $ is determined by the requirement that the selected posterior be p\p 2 . 
As shown in [3] this leads to the logarithmic form 



S[P,q} = - [dxP(x)log^ . (4) 



3 The new independence axiom 

Next we replace our old axiom 3 by an axiom which is more convincing axiom 
because it is an explicit requirement of consistency. 

New Axiom 3: Consistency for independent subsystems. When a system 
is composed of subsystems that are known to be independent it should not matter 
whether the inference procedure treats them separately or jointly. 
Again, we have to be very explicit about what this axiom means and how 
it differs from the old one. When the subsystems are treated separately the 
inference proceeds exactly as described before: for subsystem 1 maximize the 
entropy Si[Pi,qi] subject to the constraints C\ to select a posterior pi and 
similarly for subsystem 2 to select p 2 . The important difference is introduced 
when the subsystems are treated jointly. Since we are only concerned with 
those special examples where we know that the subsystems are independent, 
we are required to search for the posterior within the restricted family of joint 
distributions that take the form of a product P = P\P 2 ] this is an additional 
constraint over and above the original C\ and C 2 . 

In the previous case we chose so as to maintain independence because 
there was no evidence against it. Here we impose independence by hand as an 
additional constraint for the stronger reason that the subsystems are known to 
be independent. At first sight it appears that the new axiom does not place 
as stringent a restriction on the general form of <&: it would seem that <!> has 
been relieved of its responsibility of enforcing independence because it is up to 
us to impose it explicitly by hand. However, as we shall see, the fact that we 
seek an entropy S of general applicability and that we require consistency for 
all possible independent subsystems is sufficiently restrictive. 
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The new constraint P = PiP 2 is easily implemented by direct substitution. 
Instead of maximizing the joint entropy, S[P, 9152], we now maximize 

S[PiP 2 ,qiq 2 ] = [ dx 1 dx 2 q 1 (x 1 )q 2 (x 2 )^ ( P ^ P M \ (5) 
J \ qi{x 1 )q2{x 2 ) J 

under independent variations 6 Pi and 8P 2 subject to the same constraints Ci 
and C 2 and we choose $ by imposing that the updating leads to the posterior 
Pi{xi)p 2 {x 2 ). 



3.1 Consistency for identical independent subsystems 

Here we show that applying the axiom to subsystems that happen to be identical 
restricts the entropy functional to a member of the one-parameter family given 

by 

S n [P, q]=-J dxP(x) (^) for V -1=0 ■ (6) 

Since entropies that differ by additive or multiplicative constants are equivalent 
in that they induce the same ranking scheme, we could equally well have written 

3^ = ^(1- J d xP>^) . (7) 

This is convenient because the entropies for 77 = and 7? = — 1 can be obtained 
by taking the appropriate limits. For 77 — > use y n = exprjlogy «1 + r/logy 
to obtain the usual logarithmic entropy, So[P,q] — S[P,q] in eq.J3J. Similarly, 
for r] -> -1 we get S-i[P, q] = S[q, P]. 

The proof below is based upon and extends a previous proof by Karbelkar ^5. . 
He showed that belonging to the family of ry-entropies is a sufficient condition to 
satisfy the consistency axiom for identical systems and he conjectured but did 
not prove that this was perhaps also a necessary condition. Although necessity 
was not essential to his argument it is crucial for ours. We show below that for 
identical subsystems there are no acceptable entropies outside this family. 



Proof 



First we treat the subsystems separately. For subsystem 1 we maximize the 
entropy Si[Pi,qi] subject to normalization and the constraint Ci in eq. (|2J - In- 
troduce Lagrange multipliers ai and Ai, 



which gives 



Si[P uqi ]-Xi UdXifiPi-Fj 



dxi Pi - 1 



51 Oi) 



= Ai/i(iei) + ai , 



0, (8) 



(9) 



5 



where the prime indicates a derivative with respect to the argument, <&'(y) = 
d<fr(y)/dy. For subsystem 2 we need only consider the extreme situation where 
the constraints C 2 determine the posterior completely: P 2 (x 2 ) = p 2 (x 2 ). 

Next we treat the subsystems jointly. The constraints C 2 are easily imple- 
mented by direct substitution and thus, we maximize the entropy S[PiP2, 9192] 
by varying over Pi subject to normalization and the constraint C\ in eq. . 
Introduce Lagrange multipliers a and A, 



S[Pip 2 ,qig 2 ] - A 



which gives 



dx 2 p 2 $ 



dx 1 f 1 P 1 -F 1 



a 



dx x P l - 1 



0, 



= A[p 2 ]/i(a;i) + a\p 2 ] 



(10) 



(11) 



V 9i92 

where the multipliers A and a are independent of x\ but could in principle be 
functionals of p 2 ■ 

The consistency condition that constrains the form of <£> is that if the solution 
to eq. 10 is pi (xi ) then the solution to eq. Ijlll) must also be p\ {x\ ) , and this must 
be true irrespective of the choice of p 2 (x 2 ). Let us then consider a small change 
P2 — > P2 + 8p 2 that preserves the normalization of p 2 . First introduce a Lagrange 
multiplier a 2 and rewrite ea. Hll|) as 



(^) 



9i 92, 



a 2 



dx 2 p 2 — \ 



A[p2]/i(a;i) + a[p 2 



(12) 



where we have replaced Pi by the known solution pi and thereby effectively 
transformed eqs.© and (|ll(l into an equation for <F The 8p 2 (x 2 ) variation 
gives, 

\9i92/ 9i92 \9i92/ op 2 dp 2 
Next use eq.® to eliminate fi(xi), 



a 2 



(13) 



V9192 / 9192 V9192/ 



92 



D 



[—] 
92 



where 



p 2 1 6X p 2 5X ai 5a 

A[— } = — — and B[— ]=- — — + — 
92 Ai dp 2 92 op 2 Ai dp 2 



Ck2 



(14) 



(15) 



are at this point unknown functionals of p 2 /q 2 . Differentiating ea. (|14jl with 
respect to xi the B term drops out and we get 



92 



(?± 

dxi V 9i 



d 

dxi 



P1P2 (vm 



9192 



9192 



(16) 



which shows that A is not a functional but a mere function of p 2 /q 2 . Substituting 
back into ea. (|14f> we see that the same is true for B. Therefore eq. l|14|) can be 
written as 



*' (2/12/2) + 2/i2/2$" (2/12/2) = A(y 2 )& (2/1) + B(y 2 ) , 



(17) 
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where y% = pi/qi, y 2 = p 2 /q 2 , and A(y 2 ), B(y2) are unknown functions of y 2 . 
If we specialize to identical subsystems for which we can exchange the labels 
1 <-> 2, we get 

A(y 2 W (yi) + B(y 2 ) = Afo/i)*' (y 2 ) + Bfa) . (18) 
To find the unknown functions A and B differentiate with respect to y 2 , 

A\y 2 W (yi) + B'(y 2 ) = A(yi)&' (y 2 ) (19) 
and then with respect to 2/1 to get 

A'{yi) A'(y 2 ) 

""77 — \ = "^777 — \ = a = const . (20) 
$"(yi) $"(2/2) 

Integrating, 

A{y x ) = a$' ( Vl ) + b . (21) 
Substituting back into ea. (|19|) and integrating gives 

B'(y 2 ) = b<P"{y 2 ) and B(y 2 ) = 6$' (y 2 ) + c , (22) 

where b and c are constants. We can check that A(y) and B(y) are indeed 
solutions of eq. (|18fl . Substituting into eq. I|17|) gives 

*' (vm) + 2/12/2*" (2/12/2) = a$' (2/1) *' (2/2) + & [*' (2/1) + *' (2/2)] + c . (23) 

This is a peculiar differential equation. We can think of it as one differential 
equation for $' (2/1 ) for each given constant value of y 2 but there is a complication 
in that the various (constant) coefficients <&' (2/2) are themselves unknown. To 
solve for $ choose a fixed value of 2/2, say 2/2 = 1, 

2/$" (j/) - 7?$' (y) - k = , (24) 

where = a$' (1) + b — 1 and k = 6$' (1) + c. To eliminate the constant k 
differentiate with respect to y, 

2/$"' + (1 - ri) = , (25) 

which is a linear homogeneous equation and is easy to integrate. For a generic 
value of r\ the solution is 

$"(2/)cx2/"- 1 ^<I>'(2/)=a2/" + /3. (26) 

The constants a and (3 are chosen so that this is a solution of eq. for all 
values of y 2 (and not just for y 2 — 1). Substituting into ea. (|23[) and equating 
the coefficients of various powers of 2/12/2, 2/1 1 and y 2 gives three conditions on 
the two constants a and /3, 

a(l + j])=aa 2 , = aaf3 + ba, = a(3 2 + 26/3 + c . (27) 
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The nontrivial (a ^ 0) solutions arc a = (1 + r))/a and (5 = —b/a, while the 
third equation gives c = 6(1 — 6)/4a. We conclude that for generic values of 77 
the solution of eq. is 

Hv) = V +1 --y + c, (28) 

a a 

where C is a new constant. Choosing a = —77(77 + 1) and 6=1 + Ca we obtain 
eq.0. 

For the special values 77 = and 77 = —1 one can either first take the 
limit of the differential ca. (|25|) and then find the relevant solutions, or one can 
first solve the differential equation for general 77 and then take the limit of the 
solution eq.Q as described earlier. Either way one obtains (up to additive 
and multiplicative constants which have no effect on the ranking scheme) the 
entropies S [P, q] = S[P, q] and 5_i[P, g] = S[q, P]. 

3.2 Consistency for non-identical subsystems 

Let us summarize our results so far. The goal is to update probabilities by 
ranking the distributions according to an entropy S that is of general applica- 
bility. The functional form of the entropy S has been constrained down to a 
member of the one-dimensional family S v . One might be tempted to conclude 
(see [El El) that there is no S of universal applicability; that inferences about 
different systems ought to be carried out with different 77-entropies. But we have 
not yet exhausted the full power of our new axiom 3. 

To proceed further we ask: What is 77? Is it a property of the individual 
carrying out the inference or of the system under investigation? The former 
makes no sense; we insist that the updating must be objective in that different 
individuals with the same prior and the same information must make the same 
inference. Therefore the "inference parameter" 77 must be a characteristic of the 
system. 

Consider two different systems characterized by rj 1 and rj 2 . Let us further 
suppose that these systems are independent (perhaps system 1 is here on Earth 
while the other lives in a distant galaxy) so that they fall under the jurisdiction 
of the new axiom 3; inferences about system 1 are carried out with S r)l [Pi, qi] 
while inferences about system 2 require S V2 [P2, 92]- For the combined system we 
are also required to use an 77-entropy S V [P±P2, 9192]- The question is what 77 do 
we choose that will lead to consistent inferences whether we treat the systems 
separately or jointly. The results of the previous section indicate that a joint 
inference with S^PiPz, 9152] is equivalent to separate inferences with S v [Pi, q{\ 
and S V [P<2, (72]- Therefore we must choose rj — rj 1 and also 77 = 77 2 which is 
possible only when rj l = rj 2 . But this is not all: any other system whether here 
on Earth or elsewhere that happens to be independent of the distant system 2 
must also be characterized by the same inference parameter 77 — r/ 2 = Vi even if 
it is correlated with system 1. Thus all systems have the same 77 whether they 
are independent or not. 
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The power of a consistency argument resides in its universal applicability: if 
a general expression for S[P, q\ exists then it must be of the form S V [P, q] where 
X] is a universal constant. The remaining problem is to determine this universal 
77. One possibility is to determine 77 experimentally: are there systems for which 
inferences based on a known value of r\ have repeatedly led to success? The 
answer is yes; they are quite common. 

The next step in our argument is provided by the work of Jaynes 0] who 
showed that statistical mechanics and thus thermodynamics are theories of infer- 
ence based on the value rj — 0. His method, called MaxEnt, can be interpreted 
as the special case of the ME when one updates from a uniform prior using the 
Gibbs-Shannon entropy. Thus, it is an experimental fact without any known 
exceptions that inferences about all physical, chemical and biological systems 
that are in thermal equilibrium or close to it can be carried out by assuming 
that 77 = 0. Let us emphasize that this is not an obscure and rare example 
of purely academic interest; these systems comprise essentially all of natural 
science. (Included is every instance where it is useful to introduce a notion of 
temperature.) 

In conclusion: consistency for non-identical systems requires that 77 be a uni- 
versal constant and there is abundant experimental evidence for its value being 
77 = 0. Other 77-entropies may be useful for other purposes but the logarith- 
mic entropy S[P, q] in eq.@J provides the only consistent ranking criterion for 
updating probabilities that can claim general applicability. 

4 Bayes updating 

The two preeminent updating methods are the ME method discussed above 
and Bayes' rule. The choice between the two methods has traditionally been 
dictated by the nature of the information being processed (either constraints or 
observed data) but questions about their compatibility are regularly raised. Our 
goal here is to show that these two updating strategies are completely consistent 
with each other. Let us start by drawing a distinction between Bayes' theorem 
and Bayes' rule. 

4.1 Bayes' theorem and Bayes' rule 

The goal here is to update our beliefs about the values of one or several quantities 
9 £ O on the basis of observed values of variables x 6 X and of the known 
relation between them represented by a specific model. The first important point 
to make is that attention must be focused on the joint distribution P \&{x, 9). 
Indeed, being a consequence of the product rule, Bayes' theorem requires that 
Poid(x,0) be defined and that assertions such as "x and 6" be meaningful; the 
relevant space is neither X nor O but the product X x O. The label "old" 
is important. It has been attached to the joint distribution P a \^(x, 9) because 
this distribution codifies our beliefs about x and about 9 before the information 
contained in the actual data has been processed. The standard derivation of 
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Bayes' theorem invokes the product rule, 



Poid(x, 0) = P old (x)P old (6\x) = P oid (9)P old (x\9) , (29) 

so that 

Poid(0\x) = P old (9) ^^ . (Bayes' theorem) 

fold (a;) 

It is important to realize that at this point there has been no updating. Our 
beliefs have not changed. All we have done is rewrite what we knew all along in 
Poi d (x,0). Bayes' theorem is an identity that follows from requirements on how 
we should consistently assign degrees of belief. Whether the justification of the 
product rule is sought through Cox's consistency requirement and regraduation 
or through a Dutch book betting coherence argument, the theorem is valid 
irrespective of whatever data will be or has been collected. Our notation, with 
the label "old" throughout, makes this point explicit. 

The real updating from the old prior distribution P \ d (9) to a new posterior 
distribution P new (9) occurs when we take into account the values of x that have 
actually been observed, which we will denote with a capital X. This requires a 
new assumption and the natural choice is that the updated distribution P ncw (9) 
be given by Bayes' rule, 

P„cwW = PoiMX) ■ (Bayes' rule) 

Combining Bayes' theorem with Bayes' rule leads to the standard equation for 
Bayes updating, 

Pnew(g)=Pold(g) P ; ld ™ ■ (30) 
-^old (A ) 

The assumption embodied in Bayes' rule is extremely reasonable: we maintain 
those old beliefs about 9 that are consistent with data values that have turned 
out to be true. Data values that were not observed are discarded because they 
are now known to be false. 

This argument is indeed so compelling that it may seem unnecessary to seek 
any further justification for the Bayes' rule assumption. However, we deal here 
with such a basic algorithm for information processing - it is fundamental to 
all experimental science - that even such a self-evident assumption should be 
carefully examined and its compatibility with the ME method should be verified. 

4.2 Bayes' rule from ME 

Our first concern when using the ME method to update from a prior to a 
posterior distribution is to define the space in which the search for the posterior 
will be conducted. We argued above that the relevant space is the product 
X x O. Therefore the selected posterior P nev/ (x,9) is that which maximizes 

S[P, P„id] = - / dxdB P(x, 9) log p^ ] e) (31) 
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subject to the appropriate constraints. 

Next, the information being processed, the observed data X, must be ex- 
pressed in the form of a constraint on the allowed posteriors. Clearly, the family 
of posteriors that reflects the fact that x is now known to be X is such that 

P(x) = fd0P(x,9)=5(x-X) . (32) 

This amounts to an infinite number of constraints: there is one constraint on 
P(x, 8) for each value of the variable x and each constraint will require its 
own Lagrange multiplier A (a;). Furthermore, we impose the usual normalization 
constraint, 

/ dxdO P(x, 9) = 1 . (33) 
Maximize S subject to these constraints, 

5{S + Jdx X(x) [J d9 P(x, 9) - S(x - X)] + a [J dxdO P(x, 9) - l] } = , 

(34) 

and the selected posterior is 

Pnw{x,9)=P oXd {x,9)— , (35) 
where the normalization Z is 

Z = er a+1 = J dxdOP old (x,6)e x(x) , (36) 
and the multipliers X(x) are determined from ea. l32|l . 

\(x) „\{x) 

fd9P old (x,9)— = P old (x)— = 6(x-X) . (37) 

Therefore, substituting back into ea. (|35ll . 

P„cwOr, 0) = P °id(M)f£-*) = 5{x X)P old (9\x) . (38) 

The new marginal distribution for 9 is 

P„cw(^) = / dxP ncw {x, 9) = P oU {9\X) , (39) 

which is Bayes' rule! Bayes updating is a special case of ME updating. 

To summarize: the prior P \ d (x,9) = P id(^)-Poid(^|a;) is updated to the 
posterior P ncw (x,9) = P ncw (x)P new (9\x) where P ncw {x) = 5(x - X) is fixed by 
the observed data while P nC w(^|^) = -Poid(^l^) remains unchanged. Note that 
in accordance with the philosophy that drives the ME method one only updates 
those aspects of one 's beliefs for which corrective new evidence has been supplied. 

The generalization to situations where there is some uncertainty about the 
actual data is straightforward. In this case the marginal P(x) in ea. (|32|l is not a 
S function but a known distribution Prj(x). The selected posterior P new (x, 9) = 
Pncw{x)Pnevi{9\x) is easily shown to be P nC w(x) — Pd(x) with P n cw(#|a:) = 
-fold (9 1 x) remaining unchanged. This leads to Jeffrey's conditionalization rule, 

PncwW = J dxP new (x,9) = J dx P D {x)P old {9\x) . (40) 
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5 Conclusions 



We have shown that Skilling's method of induction has led to a unique general 
theory of inductive inference, the ME method. The whole approach is extremely 
conservative. First, the axioms merely instruct us what not to update - do not 
change your mind except when forced by new information. Second, the validity 
of the method does not depend on any particular interpretation of the notion 
of entropy - entropy needs no interpretation. 

Our derivation of the consequences of the new axiom show that when applied 
to identical subsystems they restrict the entropy to a member of the fy-entropy 
family. Its further application to non-identical systems shows that consistency 
requires that rj be a universal constant which must take the value 77 = in order 
to account for the empirical success of the inference theory we know as statistical 
mechanics. Thus, the unique tool for updating probabilities is the logarithmic 
relative entropy. Other entropies with i) ^ or those of Rcnyi or Tsallis are 
ruled out; they may be useful for other purposes but not for inference. 

Finally we explored the compatibility of Bayes and ME updating. After 
pointing out the distinction between Bayes' theorem and the Bayes' updating 
rule, we showed that Bayes' rule is a special case of ME updating by translating 
information in the form of data into constraints that can be processed using 
ME. 
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